Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Multi-stage] Support is_enable_group_trim agg option #14664

Merged
merged 1 commit into from
Dec 31, 2024

Conversation

Jackie-Jiang
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang commented Dec 16, 2024

Introduce is_enable_group_trim as an agg option to set limit and optional sort keys into aggregate node so that group trim can be applied in the leaf stage.

E.g.

SELECT /*+ aggOptions(is_partitioned_by_group_by_keys='true', is_enable_group_trim='true') */ num, COUNT(*), SUM(val), SUM(num), COUNT(DISTINCT val) FROM myTable WHERE val >= 0 AND name != 'a' GROUP BY num ORDER BY COUNT(*) DESC, num LIMIT 1

@Jackie-Jiang Jackie-Jiang added enhancement documentation Configuration Config changes (addition/deletion/change in behavior) multi-stage Related to the multi-stage query engine labels Dec 16, 2024
@codecov-commenter
Copy link

codecov-commenter commented Dec 16, 2024

Codecov Report

Attention: Patch coverage is 76.80000% with 29 lines in your changes missing coverage. Please review.

Project coverage is 63.86%. Comparing base (59551e4) to head (92a4934).
Report is 1520 commits behind head on master.

Files with missing lines Patch % Lines
...el/rules/PinotAggregateExchangeNodeInsertRule.java 76.62% 8 Missing and 10 partials ⚠️
...he/pinot/query/planner/explain/PlanNodeMerger.java 0.00% 4 Missing ⚠️
.../query/planner/logical/EquivalentStagesFinder.java 0.00% 3 Missing ⚠️
...he/pinot/query/planner/plannode/AggregateNode.java 66.66% 0 Missing and 3 partials ⚠️
.../runtime/plan/server/ServerPlanRequestVisitor.java 94.44% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14664      +/-   ##
============================================
+ Coverage     61.75%   63.86%   +2.11%     
- Complexity      207     1607    +1400     
============================================
  Files          2436     2703     +267     
  Lines        133233   150717   +17484     
  Branches      20636    23290    +2654     
============================================
+ Hits          82274    96257   +13983     
- Misses        44911    47253    +2342     
- Partials       6048     7207    +1159     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.78% <76.80%> (+2.07%) ⬆️
java-21 63.76% <76.80%> (+2.14%) ⬆️
skip-bytebuffers-false 63.80% <76.80%> (+2.05%) ⬆️
skip-bytebuffers-true 63.73% <76.80%> (+36.01%) ⬆️
temurin 63.86% <76.80%> (+2.11%) ⬆️
unittests 63.86% <76.80%> (+2.11%) ⬆️
unittests1 56.23% <76.80%> (+9.34%) ⬆️
unittests2 34.22% <7.20%> (+6.48%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Jackie-Jiang Jackie-Jiang force-pushed the aggregate_limit_push_down branch 2 times, most recently from e6d0b43 to 7c21d95 Compare December 16, 2024 19:26
@gortiz gortiz requested review from yashmayya and gortiz December 18, 2024 07:23
@gortiz
Copy link
Contributor

gortiz commented Dec 18, 2024

cc @bziobrowski

List<RexNode> projects = projectRel.getProjects();
List<RelFieldCollation> collations = sortRel.getCollation().getFieldCollations();
if (collations.isEmpty()) {
// Cannot enable group trim without sort key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is required for within segment trimming right? Is it an actual requirement for cross segment trimming?

Anyway, I'm not sure if we should make this kind of decisions here. Couldn't we populate the fields and let the actual operator decide whether an optimization can be applied or not?

More specifically: Don't you think it may be interesting to keep the limit even if we don't have collations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having non-empty collation is required for trimming within segment and cross-segment but it could be beneficial to propagate limit even if collation is missing.
That is because when limit is present (!= Integer.MAX_VALUE) combine operator might limit the number of group by keys in indexed table.

Copy link
Contributor Author

@Jackie-Jiang Jackie-Jiang Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limit can be pushed down with or without sorting. Initially I decided to not allow pushing down limit without sorting because it can cause wrong result for aggregate group-by, but realized this can prevent distinct limit to be pushed down. Since this is not enabled by default (requires hint), I'll change it to not check collation.

Comment on lines +88 to +112
public class PinotAggregateExchangeNodeInsertRule {

public static class SortProjectAggregate extends RelOptRule {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the javadoc in the main class applies to all rules, it would be cool to add a javadoc for each rule to explain what they do (and maybe add an example)

}

@Override
public void onMatch(RelOptRuleCall call) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Calcite we have onMatch and matches methods. This method seems to be doing both. I think it would be better to override matches to do not apply the rule in all the cases that are filtered out here.

For the final query is the same and we would need to repeat some code, but the advantage of using matches is that if try to debug which rules apply (using MarkerFilter, as explained in https://youtu.be/_phzRNCWJfw?si=hH9ukXAS2Iml11nq&t=331).

I've just created #14680 to do not forget how to enable these logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Calcite rules also rely on return in onMatch() if the check is not trivial to reduce overhead, e.g. ProjectToSemiJoinRule. I'm not sure if it is a good idea to rely on which method is being invoked. IMO a safer way to detect which rules are applied is to check the actual rewrite of RelNode

@@ -249,6 +249,39 @@
"\n LogicalTableScan(table=[[default, a]])",
"\n"
]
},
Copy link
Contributor

@bziobrowski bziobrowski Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR meant to enable trimming within (minSegmentTrim) and across segments (minServerTrim) ?
If so - It'd be good to mention that :

  • the former is disabled by default and requires using query option (or cluster setting)
  • the latter is enabled by default but probably haven't been applied so far

I reckon It'd be good to also test this hint has on a query without order by and/or limit clause.

Apart from manual hinting I think we could propagate limit and order by details to leaf plans and enable both types of trimming if:

  • order by is based on group by key(s) only (not aggregates)
  • there is no HAVING clause

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaf stage has both segment trim and server trim. Will add some javadoc about the behavior, and we should consider adding them also as v2 hint. Right now the behavior relies on the v1 query options.

Agree on automating it when the result is guaranteed to be accurate. This will also require some v1 engine change, so will do it separately.

@@ -321,6 +321,10 @@
"description": "aggregate with skip intermediate stage hint (via hint option is_partitioned_by_group_by_keys)",
"sql": "SELECT /*+ aggOptions(is_partitioned_by_group_by_keys='true') */ {tbl1}.num, COUNT(*), SUM({tbl1}.val), SUM({tbl1}.num), COUNT(DISTINCT {tbl1}.val) FROM {tbl1} WHERE {tbl1}.val >= 0 AND {tbl1}.name != 'a' GROUP BY {tbl1}.num"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test asserting that trimming affects results ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably make one, but that will cause wrong (and inconsistent) result, so not sure how to test it. I've manually tested it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already added some tests in #14727.
It was tricky because in many cases trimming makes result unstable.

@Jackie-Jiang Jackie-Jiang force-pushed the aggregate_limit_push_down branch from 7c21d95 to aaf5a05 Compare December 31, 2024 03:21
@Jackie-Jiang Jackie-Jiang changed the title [Multi-stage] Support is_enable_group_trim agg option [Multi-stage] Support is_push_down_limit agg option Dec 31, 2024
@Jackie-Jiang Jackie-Jiang force-pushed the aggregate_limit_push_down branch from aaf5a05 to 92a4934 Compare December 31, 2024 05:44
@Jackie-Jiang Jackie-Jiang changed the title [Multi-stage] Support is_push_down_limit agg option [Multi-stage] Support is_enable_group_trim agg option Dec 31, 2024
@Jackie-Jiang Jackie-Jiang merged commit f98b250 into apache:master Dec 31, 2024
21 checks passed
@Jackie-Jiang Jackie-Jiang deleted the aggregate_limit_push_down branch December 31, 2024 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) documentation enhancement multi-stage Related to the multi-stage query engine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants